EDA Kaggle Example
if(!require(xda)){devtools::install_github("ujjwalkarn/xda")}
if(!require(easypackages)){install.packages("easypackages")}
library(easypackages)
packages("data.table", "dplyr", "MASS", "tidyr","ggplot2", "fpc", "plotly", "caret", "glmnet", "ranger", "e1071", "clValid",
"xda", "gridExtra", "corrplot", prompt = FALSE)train <- fread("../data/Kaggle_Halloween_train.csv")
test <- fread("../data/Kaggle_Halloween_test.csv")
#Add column
train$Dataset <- "train"
test$Dataset <- "test"
full <- bind_rows(train, test)Introduction
This exercise is based on a Kaggle Competition.
900 ghouls, ghosts, and goblins are infesting the halls of Valorem and frightening our data scientists. It became clear that machine learning is the only answer to banishing our unwanted guests.
371 of the ghastly creatures have been identified, but your help is needed to vanquish the rest. Only an accurate classification algorithm can thwart them. Use bone length measurements, severity of rot, extent of soullessness, and other characteristics to distinguish (and extinguish) the intruders.
File descriptions
- Kaggle_Halloween_train.csv - the training set
- Kaggle_Halloween_test.csv - the test set
Data fields
- id - id of the creature
- bone_length - average length of bone in the creature, normalized between 0 and 1
- rotting_flesh - percentage of rotting flesh in the creature
- hair_length - average hair length, normalized between 0 and 1
- has_soul - percentage of soul in the creature
- color - dominant color of the creature: ‘white’,‘black’,‘clear’,‘blue’,‘green’,‘blood’
- type - target variable: ‘Ghost’, ‘Goblin’, and ‘Ghoul’
Data - Quick Look
str(train)## Classes 'data.table' and 'data.frame': 371 obs. of 8 variables:
## $ id : int 0 1 2 4 5 7 8 11 12 19 ...
## $ bone_length : num 0.355 0.576 0.468 0.777 0.566 ...
## $ rotting_flesh: num 0.351 0.426 0.354 0.509 0.876 ...
## $ hair_length : num 0.466 0.531 0.812 0.637 0.419 ...
## $ has_soul : num 0.781 0.44 0.791 0.884 0.636 ...
## $ color : chr "clear" "green" "black" "black" ...
## $ type : chr "Ghoul" "Goblin" "Ghoul" "Ghoul" ...
## $ Dataset : chr "train" "train" "train" "train" ...
## - attr(*, ".internal.selfref")=<externalptr>
numSummary(train)## n mean sd max min range nunique nzeros
## id 371 443.677 263.222 897.000 0.0000 897.000 371 1
## bone_length 371 0.434 0.133 0.817 0.0610 0.756 371 0
## rotting_flesh 371 0.507 0.146 0.932 0.0957 0.837 371 0
## hair_length 371 0.529 0.170 1.000 0.1346 0.865 371 0
## has_soul 371 0.471 0.176 0.936 0.0094 0.926 371 0
## iqr lowerbound upperbound noutlier kurtosis skewness
## id 473.500 -504.7500 1388.750 0 -1.259 -0.0426
## bone_length 0.177 0.0738 0.783 2 -0.212 0.0549
## rotting_flesh 0.190 0.1296 0.889 3 -0.105 0.0540
## hair_length 0.241 0.0456 1.009 0 -0.479 -0.0153
## has_soul 0.253 -0.0316 0.980 0 -0.352 -0.0342
## mode miss miss% 1% 5% 25% 50% 75%
## id 0.000 0 0 4.7000 33.000 205.500 458.000 678.500
## bone_length 0.355 0 0 0.1638 0.208 0.340 0.435 0.517
## rotting_flesh 0.351 0 0 0.1775 0.256 0.415 0.502 0.604
## hair_length 0.466 0 0 0.1642 0.241 0.407 0.539 0.647
## has_soul 0.781 0 0 0.0618 0.180 0.348 0.466 0.601
## 95% 99%
## id 844.000 886.900
## bone_length 0.652 0.756
## rotting_flesh 0.744 0.840
## hair_length 0.801 0.902
## has_soul 0.768 0.855
charSummary(train)## n miss miss% unique
## color 371 0 0 6
## type 371 0 0 3
## Dataset 371 0 0 1
## top5levels:count
## color white:137, clear:120, green:42, black:41, blue:19
## type Ghoul:129, Goblin:125, Ghost:117
## Dataset train:371
Good, no missing data.
Data Engineering
Characters to Factors
Changer character strings to factors.
head(select_if(full,is.character))## color type Dataset
## 1: clear Ghoul train
## 2: green Goblin train
## 3: black Ghoul train
## 4: black Ghoul train
## 5: green Ghost train
## 6: green Goblin train
####Convert character to factors - use full so all data is treated the same
full <- full %>% mutate_if(is.character, as.factor)
str(full)## 'data.frame': 900 obs. of 8 variables:
## $ id : int 0 1 2 4 5 7 8 11 12 19 ...
## $ bone_length : num 0.355 0.576 0.468 0.777 0.566 ...
## $ rotting_flesh: num 0.351 0.426 0.354 0.509 0.876 ...
## $ hair_length : num 0.466 0.531 0.812 0.637 0.419 ...
## $ has_soul : num 0.781 0.44 0.791 0.884 0.636 ...
## $ color : Factor w/ 6 levels "black","blood",..: 4 5 1 1 5 5 6 4 3 6 ...
## $ type : Factor w/ 3 levels "Ghost","Ghoul",..: 2 3 2 2 1 3 3 2 1 1 ...
## $ Dataset : Factor w/ 2 levels "test","train": 2 2 2 2 2 2 2 2 2 2 ...
color, type and Dataset are not factors.
Data Visualizations
Plots by Creature Type
Create boxplots to see the differences by creature.
dataViz1 <- full %>% filter(Dataset == "train") %>% ggplot(aes(x = type, y = bone_length, fill = type)) + geom_boxplot() + xlab("Creature") +
ylab("Bone Length") + scale_fill_manual(values = c("#D55E00", "#0072B2", "#009E73"))
dataViz2 <- full %>% filter(Dataset == "train") %>% ggplot(aes(x = type, y = rotting_flesh, fill = type)) + geom_boxplot() + xlab("Creature") +
ylab("Percentage of Rotting Flesh") + scale_fill_manual(values = c("#D55E00", "#0072B2", "#009E73"))
dataViz3 <- full %>% filter(Dataset == "train") %>% ggplot(aes(x = type, y = hair_length, fill = type)) + geom_boxplot() + xlab("Creature") +
ylab("Hair Length") + scale_fill_manual(values = c("#D55E00", "#0072B2", "#009E73"))
dataViz4 <- full %>% filter(Dataset == "train") %>% ggplot(aes(x = type, y = has_soul, fill = type)) + geom_boxplot() + xlab("Creature") +
ylab("Percentage of Soul Present") + scale_fill_manual(values = c("#D55E00", "#0072B2", "#009E73"))
grid.arrange(dataViz1, dataViz2, dataViz3, dataViz4, ncol = 2)A different view of the data:
p1 <- plot_ly(train, x = train$bone_length, y = train$rotting_flesh, z = train$has_soul, type = "scatter3d", mode = "markers", color=train$type)
p1Plots by Color Distribution
Compare the use of color amongst the creatures.
ghost_color <- full %>% filter(Dataset == "train") %>% filter(type == 'Ghost') %>% group_by(color) %>% summarise(count = n())
dataViz5 <- ggplot(ghost_color, aes(x = color, y = count, fill = color)) + geom_bar(stat = "identity") +
xlab("Color") + ylab("Number of Observations") + ggtitle("Ghost Colors") + scale_fill_manual(values = c("Black", "#D55E00", "#0072B2", "#F0E442", "#009E73", "#999999")) +
theme(panel.grid.minor = element_blank()) + ylim(0, 50)
ghost_color <- full %>% filter(Dataset == "train") %>% filter(type == 'Ghoul') %>% group_by(color) %>% summarise(count = n())
dataViz6 <- ggplot(ghost_color, aes(x = color, y = count, fill = color)) + geom_bar(stat = "identity") +
xlab("Color") + ylab("Number of Observations") + ggtitle("Ghoul Colors") + scale_fill_manual(values = c("Black", "#D55E00", "#0072B2", "#F0E442", "#009E73", "#999999")) +
theme(panel.grid.minor = element_blank()) + ylim(0, 50)
ghost_color <- full %>% filter(Dataset == "train") %>% filter(type == 'Goblin') %>% group_by(color) %>% summarise(count = n())
dataViz7 <- ggplot(ghost_color, aes(x = color, y = count, fill = color)) + geom_bar(stat = "identity") +
xlab("Color") + ylab("Number of Observations") + ggtitle("Goblin Colors") + scale_fill_manual(values = c("Black", "#D55E00", "#0072B2", "#F0E442", "#009E73", "#999999")) +
theme(panel.grid.minor = element_blank()) + ylim(0, 50)
grid.arrange(dataViz5, dataViz6, dataViz7, ncol = 2)Appears ghosts have shorter hair and fewer pieces of soul than ghouls and goblins, but otherwise are pretty close. Ghouls and goblins are going to be tricky to distinguish. Color doesn’t appear to help a whole lot as there seems to be a pretty even distribution to these multi-colored creatures. (Will likely remove this variable before modeling.)
Examine is there are any obvious correlations.
pairs(full[,2:5], col = full$type, labels = c("Bone Length", "Rotting Flesh", "Hair Length", "Soul"))The pairs plot above is not too helpful. Try a more scholarly approach below.
train_correlation <- train %>% select(bone_length:has_soul)
train_correlation <- cor(train_correlation)
# corrplot(train_correlation, method="circle")
# data
corrplot::corrplot.mixed(train_correlation)#cor(train_correlation)No strong evidence of correlation - largest value is close to 0.5. Perhaps we can take advantage of a combination of characteristics that do seem to show some promise: most notably “Hair Length” and “Soul”. Do we get any better separation among creatures if we combine these variables into one? By multiplying our variables together we should obtain better features to distinguish the classes.
separation1 <- full %>% mutate(hair_soul = hair_length * has_soul) %>% filter(!is.na(type))
ggplot(separation1, aes(x = type, y = hair_soul, fill = type)) + geom_boxplot() +
xlab("Creature") + ylab("Combination of Hair/Soul") + scale_fill_manual(values = c("#D55E00", "#0072B2", "#009E73"))Separation appears greater. Test various other data mutations to determine if this can be improved furthere. This is a process of experimentation.
# Sep1
separation2 <- full %>% filter(!is.na(type)) %>% mutate(sep2 = bone_length * hair_length * has_soul, sep2 = sep2 / max(sep2))
separation3 <- full %>% filter(!is.na(type)) %>% mutate(allfeatures = ((bone_length^2) * (hair_length^4) * (has_soul^4))/rotting_flesh)
separation4 <- full %>% filter(!is.na(type)) %>% mutate(bone_flesh = bone_length * rotting_flesh, bone_hair = bone_length * hair_length,
bone_soul = bone_length * has_soul, flesh_hair = rotting_flesh * hair_length, flesh_soul = rotting_flesh * has_soul,
hair_soul = hair_length * has_soul)
plotExp2 <- ggplot(separation2, aes(x = type, y = sep2, fill = type)) + geom_boxplot() +
xlab("Creature") + ylab("Combination of Bone-Hair-Soul") + ggtitle("Bone-Hair-Soul") + scale_fill_manual(values = c("#D55E00", "#0072B2", "#009E73"))
plotExp3 <- ggplot(separation3, aes(x = type, y = allfeatures, fill = type)) + geom_boxplot() +
xlab("Creature") + ylab("Combination of allfeatures") + ggtitle("All Features") + scale_fill_manual(values = c("#D55E00", "#0072B2", "#009E73"))
#separation 9 - 14 are the new variable columns
for(i in 9:14){
# print(ggplot(separation4, aes(x = type, y = separation4[i], fill = type)) + geom_boxplot() +
# xlab("Creature") + ylab("Combination of allfeatures") + ggtitle(paste(names(separation4)[i])) + scale_fill_manual(values = c("#D55E00", "#0072B2", "#009E73")))
assign(paste0("plotSep", i),ggplot(separation4, aes(x = type, y = separation4[i], fill = type)) + geom_boxplot() +
xlab("Creature") + ylab("Combos") + ggtitle(paste(names(separation4)[i])) + scale_fill_manual(values = c("#D55E00", "#0072B2", "#009E73")))
}
grid.arrange(plotSep9, plotSep10, plotSep11, plotSep12, plotSep13, ncol = 2)Because separation apprears to be improved by combining variables, will likely use this for modeling later in the example.
Clustering data
While clustering is generally used for unsupervised machine learning, take a peek at the clusters that could be formed. The potential issue with trying to cluster this data is that we are working with two types of data: continuous and categorical. They break down like this:
| Continuous Variables | Categorical Variables |
|---|---|
| bone length | id |
| rotting flesh | color |
| hair length | |
| has soul |
There are only two categorical variables. Because of our small sample size, it’s not a good idea to count out these variables completely, but we’ll try to create clusters without them just to see how well the clustering models do.
kmeans function.
# Set the seed
set.seed(1234)
# Extract creature labels and remove column from dataset
creature_labels <- full$type
full2 <- full
full2$type <- NULL
# Remove categorical variables (id, color, and dataset) from dataset
full2$id <- NULL
full2$color <- NULL
full2$Dataset <- NULL
# Perform k-means clustering with 3 clusters, repeat 30 times
creature_km_1 <- kmeans(full2, 3, nstart = 30)Look at them graphically first. This was created using the plotcluster() function from the fpc package.
The clusters do not look discrete. Consider Dunn’s Index mathematically to see if we are missing something visually. This calculation comes from the dunn function in the clValid package.
dunn_ckm_1 <- dunn(clusters = creature_km_1$cluster, Data = full2)
dunn_ckm_1## [1] 0.0425
Dunn’s Index represents a ratio of the smallest distance between clusters to the largest distance between two points in the same cluster (or, the smallest inter-cluster distance to the largest intra-cluster distance). S low number indicates that clusters are not condensed, separate entities. This is not surprising considering we completely disregarded one of our variables.
See how well this clustering method correctly separated the labelled creatures.
table(creature_km_1$cluster, creature_labels)## creature_labels
## Ghost Ghoul Goblin
## 1 3 95 35
## 2 106 5 20
## 3 8 29 70
Ghosts were separated relatively well, but ghouls and goblins are split between the clusters. No new information was identified. But it’s been an interesting exploratory path!
Modeling for Creature Identity
Split out the test and training data back into separate datasets. Note, modify full with the variables created in separation4 dataset with the varible interactions. RemoveID, colr and Dataset variables.
full <- full %>% mutate(bone_flesh = bone_length * rotting_flesh, bone_hair = bone_length * hair_length,
bone_soul = bone_length * has_soul, flesh_hair = rotting_flesh * hair_length, flesh_soul = rotting_flesh * has_soul,
hair_soul = hair_length * has_soul)
train_complete <- filter(full, Dataset == "train")
train_complete <- select(train_complete, c(-id, -color, -Dataset))
test_complete <- filter(full, Dataset == "test")
test_complete <- select(test_complete, c(-id, -color, -Dataset))Because we are using caret package, generate a standard trainControl so tuning parameters remain consistent throughout the testing and experimentation.
Creating trainControl
create a control that performs 20 repeats of a 10-Fold cross-validation of the data.
myControl <- trainControl(method = "repeatedcv", number = 6, repeats = 10, verboseIter = TRUE)#repeatedcv vs cv - test and compareRandom Forest Modeling
Start with a random forest model, generated using the ranger and caret packages. Include all of the original variables, including any interactions.
set.seed(1234)
modelrandomforest <- train(type~., tuneLength = 3, data = train_complete, method = "ranger", trControl = myControl, importance = 'impurity')Examine the levels of importance of each factor in this model.
hair_soul variable seems to be the most important to this model and our other interactions rank pretty highly.
GLMnet Modeling
Generalized linear model (GLM) is a flexible generalization of ordinary linear regression that allows for response variables that have error distribution models other than a normal distribution.
Test a glmnet model also from caret.
set.seed(1234)
modelglm <- train(type~., method = "glmnet", tuneGrid = expand.grid(alpha = 0:1, lambda = seq(0.0001, 1, length = 20)),
data = train_complete, trControl = myControl)Without going into depth, a few more algorithms are evaluated.
set.seed(1234)
# train LVQ model (Learning Vector Quantization)
modelLvq <- train(type~., data = train_complete, method="lvq", trControl = myControl)
# train Support Vector Machine model
modelSvm <- train(type~., data = train_complete, method="svmRadial", trControl = myControl)
# decision tree
modeltree <- train(type~., data = train_complete, method="rpart", trControl = myControl)
# Tree + PCA
modeltreepca <- train(type~., data = train_complete, method="rpart", trControl = myControl, preProcess = "pca", parms = list(split='information'))
# KNN
modelknn <- train(type~., data = train_complete, method="knn", trControl = myControl)
# Naive Bayes
modelbayes <- train(type~., data = train_complete, method="nb", trControl = myControl)Before moving on, here is a last decision tree algorithm provided in a visual format:
library("rpart.plot")
fit2 <- rpart(type~., method = "class", data = train_complete, control = rpart.control(minsplit = 50), parms = list(split='information'))
rpart.plot(fit2, type=2, extra = 1)Comparing Models
Compare the results of all the models.
# Create a list of models
models <- list(Random_Forest = modelrandomforest, GLM = modelglm, LVQ = modelLvq, SVM = modelSvm, DecisionTree = modeltree,
DecisionTree_PCD = modeltreepca, NaiveBayes = modelbayes, KNN = modelknn, NaiveBayes = modelbayes)
# Resample the models
resampled <- resamples(models)
# Generate a summary
summary(resampled)##
## Call:
## summary.resamples(object = resampled)
##
## Models: Random_Forest, GLM, LVQ, SVM, DecisionTree, DecisionTree_PCD, NaiveBayes, KNN, NaiveBayes
## Number of resamples: 60
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NaiveBayes 0.629 0.694 0.726 0.734 0.764 0.871 0
## Random_Forest 0.619 0.687 0.742 0.733 0.774 0.873 0
## GLM 0.617 0.726 0.758 0.758 0.806 0.873 0
## LVQ 0.597 0.676 0.718 0.719 0.762 0.850 0
## SVM 0.583 0.672 0.710 0.706 0.738 0.836 0
## DecisionTree 0.541 0.602 0.626 0.630 0.661 0.733 0
## DecisionTree_PCD 0.516 0.661 0.702 0.695 0.738 0.806 0
## KNN 0.623 0.710 0.742 0.742 0.774 0.839 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## NaiveBayes 0.441 0.541 0.588 0.601 0.645 0.806 0
## Random_Forest 0.430 0.530 0.613 0.599 0.661 0.810 0
## GLM 0.425 0.589 0.638 0.637 0.709 0.809 0
## LVQ 0.396 0.514 0.577 0.578 0.642 0.775 0
## SVM 0.371 0.508 0.564 0.557 0.606 0.754 0
## DecisionTree 0.311 0.400 0.441 0.445 0.493 0.600 0
## DecisionTree_PCD 0.262 0.491 0.551 0.541 0.605 0.709 0
## KNN 0.434 0.564 0.613 0.613 0.661 0.757 0
# Plot the differences between model fits
dotplot(resampled, metric = "Accuracy")Predicting Creature Identity
The glmnet model provides the highest accuracy so use that model to predict Halloween classification in the test set.
# Make predicted survival values
my_prediction <- predict(modelglm, test_complete)The first ten predictions are: Ghoul, Goblin, Ghoul, Ghost, Ghost, Ghost, Ghoul, Ghoul, Goblin, Ghoul
Addendum
Learning Vector Quantization (LVQ)
(Added this because it was recently learned.) Learning vector quantization (LVQ) is an algorithm that is a type of artificial neural networks and uses neural computation. More broadly, it can be said to be a type of computational intelligence. This algorithm takes a competitive, winner-takes-all approach to learning and is also related to other neural network algorithms like Perceptron and back-propagation. The LVQ algorithm allows one to choose the number of training instances to undergo and then learns about what those instances look like. LVQ is related to the k-nearest neighbor algorithm.